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Abstract 

We consider large random trees under Gibbs distributions and prove 
a Large Deviation Principle (LDP) for the distribution of degrees of 
vertices of the tree. The LDP rate function is given explicitly. An im- 
mediate consequence is a Law of Large Numbers for the distribution of 
vertex degrees in a large random tree. Our motivation for this study 
comes from the analysis of RNA secondary structures. 

Keywords: random trees, Gibbs distributions, large deviations, RNA 
secondary structure 

1 Introduction 

In this note, we prove a Large Deviation Principle (LDP) for two mod- 
els of equilibrium statistical mechanics. In both cases, we consider a 
set of trees on vertices and we define the Gibbs distribution asso- 
ciated to a certain energy function on that set. The main goal of our 
work is to study some typical features of large random trees {N oo) 
under these distributions. 

Here, we provide rigorous proofs for the LDP results announced 
in |BHj . As discussed there, our results are motivated by, and have 
applications to, the branching of RNA secondary structures. The 
trees we consider are a useful abstraction of these biological structures 
(see |Heib[ IHeia] for references on this connection) as well as relatively 
straightforward to analyze mathematically. In this simplified model of 
RNA folding, we can address the interplay between entropy and energy 
in determining a "typical" branching configuration. We find that, due 
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to the entropy factor, the typical configurations in our model differ 
from the arrangements which have minimal energy in interesting ways. 

Our mathematical results support and extend recent developments 
in RNA secondary structure prediction (reviewed in [Mat06[ [MT06j ) 
which broaden the focus beyond simply finding a structure with min- 
imal free energy. In particular, we prove a Law of Large Numbers for 
the degree frequencies in our large random trees, and find that the 
most common trees are not the minimizers of the associated energies. 
This highlights the limitations of prediction methods focused solely on 
energy minimization and the significance of entropy considerations in 
computational structural biology. 

2 Models and results 

In this section we describe our models and state the results. The proofs 
are given in the next section. 

2.1 Labeled trees 

In our first model we fix a natural number D > 2 and for each N E N 
consider the set Ti\[{D) of labeled trees on iV G N vertices such that the 
degree of each vertex does not exceed D. To define Gibbs distributions 
on Tn{D) we need a function c : {1, . . . , D} — > R which plays the role 
of the energy associated with the degree of a vertex. 

To each of the trees T in Tn{D) we associate the energy 

N D 

H{T) = ^ c{d, (T)) = <k)xu{T), (1) 

3 = 1 k=l 

where dj{T) denotes the degree of the j-th vertex, and Xk{T) is the 
number of vertices of degree k in T. Now the Gibbs probability measure 
on T]y{D) associated with H is given by 

Pn{T} = — , T e Tm{D), 

where /3 > is the inverse temperature parameter and 

Zn^Y. (2) 

is the partition function. 

Our first result is an LDP for the degree distribution of random 
labeled trees under measures Pn introduced above. 
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Let us recall that a sequence of probability measures (/ijv)A'eN on a 
compact metric space {E, p) satisfies an LDP with a lower-semicontinuous 
nonnegative rate function / : ^ R if 

limsup — In /^Ar(C) < —I{C), for any closed set C C E, 

and 

liminf — ln/xjv(0) > —I(0), for any open set O C E, 

N^oo N 

where for U C E, 

/([/)= inf/(p). 

peu 

See pl06l Section ILS] or [DZ98[ Section 1.2] for further details. 

Informally, an LDP means that if we consider random variables 
with distribution /zjv, then for all p and large N we have 

In particular, if the minimal value is attained by / at a unique point p* 
then for any neighborhood O oi p* , ^jlm{0'^) decays exponentially in N. 
This can be restated as a Law of Large Numbers with exponential 
convergence in probability to the limit point p* . 

We can view {xi, ■ ■ ■ , Xd) as a random vector defined on the prob- 
ability space T]y{D) equipped with the Gibbs measure P/v- We would 
like to study the frequencies of vertex degrees, so for each TV we intro- 
duce a probability measure on [0, 1]^ defined as the distribution of 
the random vector -^(xij ■ • • i Xd) under Pjv- It is natural to formulate 
an LDP for z^at on the set 

{D D 
pe[0,l]^: ^Pfe = l, 5]fcpfe = 2 
fc=i fe=i 

equipped with Euclidean distance. (Notice that Ai is nonempty if 
D > 2.) Though the random vector ;^(xi, • • ■ , Xd) does not belong to 
M, it is asymptotically close to M: 

D D ^ 

^ N ^ N N 

fc=i fc=i 

So instead of formulating an LDP for the sequence of random vectors 
■^(Xi, . . . tXd), we shall formulate and prove an LDP for a sequence 
of random vectors that is close to it and belongs to M. 

To define the rate function, we introduce J : — > M via 

J[p)^-h{p)+f5E{p)+G{p), 
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fc=i 



is the entropy of the probabilty vector p — {pi, . . . ,pd), 



D 



is the energy associated with p, and G{p) is defined by 



D 



G{p)^Y.P>^ ln((fc-l)!). 



(3) 



In Section [21 we shaU see that the function G appears naturally in the 
analysis of random trees. 

The function J is strictly convex down and continuous on Ai. 
Therefore, it attains its minimal value at a uniquely defined point 
p* £ M. Consider now 



It is easy to see that / is bounded, convex and continuous on M. 

For a measure Q on [0, 1]^ x M we define Q^^^ and Q^^^ as the 
marginal distributions of Q on [0, 1]^ and Ai respectively. 

Theorem 1 There is a sequence of probability measures (QAr)jveN de- 
fined on [0, 1]^ X A4 with the following properties. 

1. For each N , we have Q^^^ — i^n- 

2. For each N, 



3. The sequence {Qj^ )NeN satisfies an LDP on Ai with the rate 
function I defined in ([4]). 

Remark 1 This theorem says that although the random vector x/^ 
does not belong to Ai, one can find another random vector that is, on 
the one hand, very close to x/iV and on the other hand belongs to Ai 
and satisfies the LDP. 

Theorem [1] immediately implies the following Law of Large Num- 
bers: 



I{p)^J{p)-J{p*). 



(4) 




= 0. 
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Corollary 1 As N 



V7V''"' TV 



) 



P 



in probability. 

Remark 2 The statements above show that with high probabihty the 
degree frequencies are close to p* . Note that in most cases the minimum 
of the energy E on A4 is not attained at p* . 

2.2 Plane trees 

We now consider a similar model for plane trees (sometimes also called 
ordered trees). These are rooted trees such that subtrees at any vertex 
are linearly ordered, see e.g. [Sta99j . We redefine the notation intro- 
duced in the previous section. We fix a number D Q N and for each 
G N let Tn{D) denote the set of ordered trees on iV G N vertices 
such that the branching (i.e. the number of children) at each vertex 
does not exceed D. The energy of each vertex depends only on its 
branching and is given by a function c : {0, 1, ... , D} — > M. With each 
tree T G Tn{D) we associate the energy 



where Xk{T) is now the number of vertices with k children in T. The 
Gibbs probability measure on Tn{D) associated with H is given by 

e-/9H(T) 

Pn{T} - — , T G Tn{D), 

ZjN 

where /3 > is the inverse temperature and is a normalizing con- 
stant. 

For each N , we introduce a probability measure vn on [0, 1]^+^ 
defined as the distribution of the random vector -^(xo, Xij • ■ • ; X-d) 
under P^- 

We redefine M to be 



To formulate an LDP for this model we define J : AA ^M. via 



D 




(5) 




J{P) 



h{p) + (3E{p), 



where 



D 
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is the entropy of the probabilty vector p — {pQ,pi . . . ^po), and 

D 
fe=0 

is the energy associated with p e . 

As in the first model, the function J attains its minimum on A4 at 
a unique point that we denote by p* . Let 

Hp) ^ Jip) - J{p*). (6) 

This function wih play the role of the rate function. Notice that in the 
case of plane trees it does not involve the function G{p) that appeared 
in the construction of the rate function for the case of labeled trees. 

For a measure Q on [0, 1]^"*"^ x we define Q^^) and Q^^) the 
marginal distributions of Q on [0, 1]^+^ and A4 respectively. 

Theorem 2 There is a sequence of probability measures {QN)NeN de- 
fined on [0, 1]^^^ X Ai with the following properties. 

1. For each N, we have q'"^^ — vj^. 

2. For each N, 

QN\{x,y) e [0,lf+' xM:J2\^k-yk\ > ^| =«■ 

I fe=0 J 

3. The sequence {Q}f )ngn satisfies an LDP on M. with the rate 
function I defined in ^ . 

An immediate consequence is the following Law of Large Numbers: 
Corollary 2 As N oo, 

(Xo Xi Xd_\ 
\N' iV"' TV / ^ 

in probability. 



3 Proofs 

We start with the proof of Theorem [1] adopting the notation and 
setting for labeled trees from Section [27T] 

The crucial fact for our analysis is the following formula for the 
number of trees on N vertices with degrees di, . . . , dN'. 

N -2 

di - 1, (i2 - 1, ■ ■ • , dN — 
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if di + . . . + = 2iV-2, and otherwise, see |Moo70[ Formula (2.1)]. 
Therefore, the total number of iV-trccs T with x{T) = (^i, ...,«£)) is 
given by 

iV-2 \/N 

ni n2 no 

(N - 2)! 
" (2!)"3...((i:)-l)!)"i,'^(^' 

where C{N,n) = (^^ ^nr>)- these trees T have the same energy 
H{T), so that 



N 



X{T) _n \ _ e-^-^(^)c(iV,n) 



N N ^ 

where Zjsi is defined in ([2]), and we notice that 



(7) 



ni+...+nD=N 
ni + ... + Dno=2N-2 



and 

D D 

F{p) = pE{p) + G{p) - c(fcK + E - P e [0' 1]' 



fc=i fc=i 



with G'(p) defined in 

Our plan is to use the LDP for multinomial distribution that man- 
ifests itself in coefficients C{N,n) in the r.h.s. of ([7]), and then apply 
a version of Varadhan's lemma for Gibbs transformation via the expo- 
nential factor e~^^(~). 

We start with the family of distributions iin on M defined by 



r/ni now 

where 



0, otherwise 



n/NeM 

Lemma 1 The sequence of measures {fiN)NeN satisfies an LDP on 
A4 with rate function Ii defined by 

hip) ^h* -hip), 

where 

h* = sup hip). 
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Proof. The proof of this lemma hteraUy repeats that of Sanov's the- 
orem (an LDP for the multinomial distribution, see [DZ98[ Theorem 
2.1.10]). It is based on the formula: 



N 

k=l 

which holds true uniformly in n, see e.g. |E1106[ Lemma L4.4]. 
Let us now introduce the Gibbsian weight 

,»(^)^.-<*). 

and a new family of measures Xn on M: 



where 



for — e M, 



7" — 



In other words, 



XN{dp) 



Let us also denote Ji{p) — F{p) + Ii{p) and Ji^* = 'mip^M Ji{p)- 

Lemma 2 The sequence of measures (Ajv)a'gn satisfies an LDP on 
M. with rate function I2 given by l2{p) — Ji{p) ~ Ji,*- 

Proof. This lemma follows directly from a variant of Varadhan's lemma 
for Gibbs transformations (Theorem II. 7. 2 in [Ellis]). 

Remark 3 Notice that l2{p) ~ I{p) for allp G Al. So we have proven 
the desired LDP on M for {XN)Ne'fi^ in order to prove Theorem [1] 
we shall have to compare Xn to vn- 

Proof of Theorem [H We consider the distribution Pjv on TAr(D), so 
that is distributed according to vm- For each x that belongs to the 
support of Vfq we introduce the set 



R{x) = |y e : za- = G Z, A: = 1, . . . , i5, 

, 2 



and ^\xk-Vk\ = 
fe=i 
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It is easy to see that 1 < |i?(a;)| < for all x, where \R\ denotes the 
number of elements in R. 

Let us now define the measure Qn- We start with random variables 
x/-/V, and define a random vector Y so that, given xl^i the conditional 
distribution of Y is uniform on R{x/N). Now Qn denotes the joint 
distribution of x/^ and Y. Clearly, the first two desired properties 
of Q hold true by the definition of Qn- The third one follows from 
Lemma [5] and the following statement claiming that measures Q}^ 
and Xn differ by a subexponential factor, thus obeying an LDP with 
the same rate function: 

Lemma 3 There is a constant C > such that for all N and all sets 
U CM, 

1 <q5M<c^4. 



CiV4 - Xn{U) 

This lemma is a straightforward consequence of the following fact: 
there is a constant K such that if \ni — n[ \ + . . . + \no — n'j~,\ = 2 then 

< —r. ^ ' ' < KN^. 



C{N, n') 



The proof of Theorem [2] is essentially the same. It is based on the 
following expression for the number of ordered trees of order N with 
rife nodes having k children: 

f / N 

N \no, ui, n2 . 

if ni + 2ri2 + . . . = — I, and otherwise (see e.g. Theorem 5.3.10 in 
[5ti99] ). 
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